Offline Learning of Closed-Loop Deep Brain Stimulation Controllers for Parkinson Disease Treatment

Deep brain stimulation (DBS) has shown great promise toward treating motor symptoms caused by Parkinson's disease (PD), by delivering electrical pulses to the Basal Ganglia (BG) region of the brain. However, DBS devices approved by the U.S. Food and Drug Administration (FDA) can only deliver continuous DBS (cDBS) stimuli at a fixed amplitude; this energy inefficient operation reduces battery lifetime of the device, cannot adapt treatment dynamically for activity, and may cause significant side-effects (e.g., gait impairment). In this work, we introduce an offline reinforcement learning (RL) framework, allowing the use of past clinical data to train an RL policy to adjust the stimulation amplitude in real time, with the goal of reducing energy use while maintaining the same level of treatment (i.e., control) efficacy as cDBS. Moreover, clinical protocols require the safety and performance of such RL controllers to be demonstrated ahead of deployments in patients. Thus, we also introduce an offline policy evaluation (OPE) method to estimate the performance of RL policies using historical data, before deploying them on patients. We evaluated our framework on four PD patients equipped with the RC+S DBS system, employing the RL controllers during monthly clinical visits, with the overall control efficacy evaluated by severity of symptoms (i.e., bradykinesia and tremor), changes in PD biomakers (i.e., local field potentials), and patient ratings. The results from clinical experiments show that our RL-based controller maintains the same level of control efficacy as cDBS, but with significantly reduced stimulation energy. Further, the OPE method is shown effective in accurately estimating and ranking the expected returns of RL controllers.


INTRODUCTION
Currently, around 1.05 million individuals in the United States are affected by Parkinson's disease (PD) [44]. Deep brain stimulation (DBS) is an effective treatment to reduce PD symptoms such as tremor and bradykinesia [3,12,13,49]. A DBS system consists of electrodes that are placed into the Basal Ganglia (BG) region of the brain, and a pulse generator implanted in the chest to generate trains of short electrical pulses (see Fig. 1). Existing FDA-approved DBS solutions are limited to continuous DBS (cDBS). These devices are programmed to stimulate at a fixed amplitude, with the specific parameters determined by clinicians through trial-and-error [52]. However, such stimuli usually lead to extensive energy consumption, significantly reducing the battery lifetime of the device. Moreover, over-stimulated patients, even intermittently, may suffer from side-effects such as dyskinesia and speech impairment [5]. As a result, developments of closed-loop DBS controllers that are more responsive to activity and patient state (i.e., context) are of considerable interest to clinicians, patients, and the community.
Existing DBS control methods focus on simply switching on/off the stimulation or scaling up/down its intensity in a proportional control approach, conditioned on the change of specific biomarkers, i.e., when they cross over some pre-determined thresholds [1,2,5,40,41]. Biomarkers include local field potentials (LFPs) and electroencephalography (EEG) from the BG, as well as accelerometery data and electromyography obtained from wearable devices [50]. Though such methods have improved energy efficiency [25,41], they still require substantial efforts to experiment and fine-tune the thresholds for each specific patient. Moreover, the patient may suffer from sub-optimal DBS settings in between clinical visits with arXiv:2302.02477v4 [cs.LG] 16 Mar 2023 poor symptom control due to varying patient state. For example, exercise or fluctuations in medication dosage or timing could affect their PD symptoms and DBS control, so the tuning results may be biased. Consequently, the challenge (I) of developing closed-loop DBS controllers is to ensure that the control policy can perform consistently over diverse and dynamic patient contexts and states.
Reinforcement learning (RL) has shown considerable potential in control over complicated systems [15,21,22,46], and various RL-based approaches have been proposed to facilitate closed-loop DBS [19,23,48,52]. Specifically, several approaches [23,48,52] model EEG and LFP as the state space of the RL environment and use temporal difference learning or fitted Q-iteration to design control policies adapting stimulation amplitudes/frequencies to conserve energy usage. The deep actor-critic based approach proposed in [19] further allows the temporal pattern of the stimuli to be adapted over time, benefiting from the use of deep RL techniques capable of searching in larger state and action space. Although such methods achieve satisfactory control of efficacy and energy savings jointly, they have only been evaluated in simulations, i.e., on computational BG models [30,58]. One may assume that unlimited training data can be obtained from such models, which is contrary to the realworld case where the device programming is done in clinics and the patient only participates sparsely over time.
Another limitation of directly using deep RL methods for realtime DBS control is the computational complexity of evaluating the RL policies in vivo, as they are usually represented by deep neural networks (DNNs) that may require millions of multiplications in a single forward pass. The resource-constrained implantable devices (e.g., Fig. 1) may not support or facilitate such computations. Thus, the challenge (II) of closed-loop DBS is to ensure that the controller can be designed with limited training samples and executed without the need of extensive computing resources. Further, in contrast to simulated or robotic environments where most RL policies can be deployed directly for performance evaluation, the safety and control efficacy of the controllers directly used on patients need to be thoroughly evaluated before each test condition starts [51]. Hence, the challenge (III) of enabling closed-loop DBS therapies in patients is being able to proactively provide accurate estimations of the expected performance of the controllers.
Consequently, in this paper, we first introduce an offline RL framework to address the challenges (I) and (II) above, resulting in a closed-loop DBS system that is both effective (in terms of therapy) and energy-efficient. Specifically, we model the BG regions of the brain as a Markov decision process (MDP), capturing the underlying neuronal activities in response to the stimuli. Then, the deep actor-critic algorithm [39] is adapted to adjust the amplitude of the stimuli according to the changes in LFPs. A total of four patients, equipped with the Medtronic Summit RC+S DBS devices [59], participated in the data collection and testing trials in clinics. Given that the deep actor-critic framework is considered offline RL and can leverage all historically collected trajectories, i.e., experience replay to facilitate optimizing the control policy, we address challenge (I) by varying the level of activities, medications etc. of the patients before and during the trials. Similarly, experience collected from non-RL controllers can also be used to update the policy; for example, in the early stage of learning, a controller that generates uniformly random amplitudes (within some range) can facilitate exploring the Figure 1: An implantable deep brain stimulation (DBS) device. The stimuli, generated by the pulse generator at a given amplitude and frequency, are delivered to the basal ganglia (BG) through multi-contact electrodes. Each electrode has four contacts; two stimulate the BG and two sense local field potentials (LFPs) that may be used for control feedback.
state and action space. We also introduce model distillation/compression [26] techniques specifically for the DBS systems, such that the RL policies can be captured by deep neural networks (DNNs) with significantly fewer nodes, whose forward passes can be executed within the required control rates, addressing challenge (II).
To address challenge (III), we introduce a model-based offline policy evaluation (OPE) method that captures the underlying dynamics of the considered MDP, where the expected returns of the control policy can be estimated by the mean return of the trajectories rolled out from the learned model, without directly deploying the policy to the patient. In each DBS trial, the control efficacy is evaluated from various sources, including LFP biomarkers recorded from the implantable DBS device, patient responses to bradykinesia tests, satisfaction level reported by the patient, and the overall tremor severity quantified from accelerometry data collected by external wearable devices (e.g., smart watch). Note that each of the latter three criteria is only evaluated once at the end of each trial; yet they are imperative for evaluating the control efficacy from the patient's side. These efficacy metrics are thus considered sparsely available compared to the LFPs that can be sensed in each time step, which limits the use of existing OPE methods, including importance sampling (IS) [16,54], distributional correction estimations (DICE) [47], and the model-based OPE [20], as these do not allow for explicitly capturing/modeling such end-of-session rewards. Our OPE method can capture such behaviors through a specially designed architecture and training objective, outperforming existing methods as we show in clinical experiments.
The contributions of this work are three-fold: ( ) to the best of our knowledge, this is the first 'full-stack' offline RL methodology that facilitates both optimizing and evaluating RL-based DBS control policies using historical data; ( ) we developed an RL-based DBS controller whose performance is validated through clinical trials with PD patients, demonstrating reduced energy consumption with non-inferior control efficacy compared to cDBSthis is the first effective closed-loop DBS control that is not an ON/OFF switching, or scaling up/down proportionally, and has been extensively tested in clinic (i.e., on patients); ( ) our OPE method effectively captures the end-of-session rewards, leading to accurate Figure 2: The overall architecture of the RC+S DBS system. The Summit research and development kit (RDK) can be used to configure the Summit program, allowing us to compute the beta amplitude ( ) and execute the RL controller.
estimations of control efficacy using the data collected in clinic; thus, helps demonstrate the effectiveness of the policies to be tested proactively, and can be used to prioritize the policies that could lead to better performance within the limited amount of testing time.
This paper is organized as follows. Sec. 2 provides the basics of DBS, RL, and OPE, before our clinical closed-loop DBS setup is introduced in Sec. 3. In Sec. 4, the offline RL framework is introduced, enabling training and updating RL controllers with historical data. Sec. 5 introduces the model-based OPE approach to estimate performance of RL policies. Sec. 6 presents the results of the experimental evaluations on patients, before concluding remarks in Sec. 7.

PRELIMINARIES AND MOTIVATION
In this section, we first introduce DBS, before presenting in the next section the DBS experimental setup we developed for clinical trials, including sensing, communication and control. Also, preliminaries for offline RL and OPE are briefly introduced; more comprehensive reviews of RL and OPE can be found in [19,20,39,57].

The Need for Closed-Loop DBS
PD is caused by progressive death of dopaminergic neurons in the substantia nigra region of the brain. This change in dopaminergic signaling results in pathological activity in the BG regions targeted by DBS, globus pallidus pars interna (GPi), globus pallidus pars externa (GPe) and subthalamic nucleus (STN); see Fig. 1. Given the reduced number of neurons, the level of dopamine generally decreases in BG, leading to various motor symptoms such as bradykinesia and tremor [7,11,34]. Physiologically, the effect of PD can be captured by the changes in LFPs in GPi, GPe and STN. Specifically, PD can cause abnormal neuron firings in these regions, and lead to increased beta-band (13-35 Hz) amplitude ( ), referred to as the beta amplitude, of the LFPs [20].
Existing research-only DBS devices are capable of capturing the changes in LFPs through the multi-contact electrodes implanted in the BG. As illustrated in Fig. 1, we used 4-contact electrodes placed in the STN and GP regions. Monopolar stimulation was delivered on a single contact on each lead (with the case serving as counterelectrode). The two contacts surrounding the stimulation contact were used for sensing LFPs (i.e., sandwich sensing). Existing devices providing open-loop cDBS stimulate pulses at a fixed amplitude, which in most cases can correct the abnormal neuronal activity [37]. However, constantly stimulating with high amplitudes significantly reduces the battery lifetime of the DBS device and may cause serious side-effects such as speech impairment [4,40,60]. Consequently, it is important to design DBS controllers that are effective (from the control, i.e., therapy, perspective) and energy-efficient.
As discussed in Introduction, current aDBS approaches require considerable time and effort for the patients and their healthcare providers to determine the thresholds through trial-and-error [63]. Several deep-RL-based controllers have been proposed for closedloop DBS, which can adapt the amplitude of the stimulation pulses in real time [19,20] in response to changes in the feedback signals (e.g., ). However, such frameworks are only validated through numerical simulations, i.e., on simplified computational BG models, instead of clinical trials with human participants. In real world, substantial historical experience, or trajectories collected from past interactions between the controller and the environment (patient), may be necessary to learn an RL policy with suitable control efficacy and patient satisfaction [38]. Offline RL holds promise to resolve this challenge, as it can use the data collected from any type of controllers, including cDBS or simply a policy switching between arbitrary stimulation amplitudes/frequencies, to optimize an RL control policy. Moreover, each time before a new control policy is deployed to the patient, the clinicians need to assess its effectiveness and may require justifications toward its estimated control efficacy and performance [51]. OPE can facilitate such use cases, as it is capable of estimating the expected return of RL policies using historical trajectories, bridging the gap between the offline RL training and evaluations. Preliminaries for offline RL and OPE are presented in two subsections below.

Offline Reinforcement Learning
Offline RL has proven useful in many domains, including robotics [18,22], healthcare [21], etc., since it can optimize the control policies without requiring the environment to be presented, which guarantees the safety of the learning process. Further, it does not require the training data to be exclusively collected by the control policy being updated, leading to improved sample efficiency. To facilitate offline RL, the underlying dynamical environments are firstly modeled as Markov decision processes (MDPs). Definition 2.1 (MDP). An MDP is a tuple M = (S, 0 , A, P, , ), where S is a finite set of states; 0 is the initial state; A is a finite set of actions; P is the transition function defined as P : S × A → S; : S × A × S → R is the reward function, and ∈ [0, 1) is a discount factor.
Then, the RL policy : S → A determines the action = ( ) to be taken at a given state . The accumulated return under a policy can be defined as follows.
Definition 2.2 (Accumulated Return). Given an MDP M and a policy , the accumulated return over a finite horizon starting from the stage and ending at stage , for > , is defined as where + is the return at the stage + .
The goal of offline RL can now be defined as follows.
Problem 1 (Offline Reinforcement Learning). Given an MDP M with unknown transition dynamics P, a pre-defined reward function , and a experience replay buffer E = {[( 0 , 0 , 0 , 1 ), . . . , Figure 3: Timeline for training RL-based DBS controllers in clinical studies. Since only limited data can be collected during each clinical visit, offline RL can be used to fine-tune existing or train new controllers using all the historical data. Then, offline policy evaluation (OPE) facilitates choosing the possible top-performing ones to be tested in the next visit.
containing trajectories collected over an unknown behavioral policy , find the target policy * such that the expected accumulative return starting from the initial stage over the entire horizon is maximized, i.e., here, is the state-action visitation distribution under policy .
The deep actor-critic RL framework [39] can be leveraged to solve (2). Other value-based RL methods such as conservative Qlearning [36] and implicit Q-learning [33] could also be considered; however, actor-critic methods can in general reduce the variance of gradient estimations and result in faster convergence [19,45,64]. Here, we specifically consider the deterministic version of actorcritic [39], instead the one producing stochastic policies [24], as it would be easier to demonstrate the effectiveness of deterministic policies in clinics, as well as via OPE methods introduced below. Details on the deep actor-critic algorithm [39] are provided in Appendix C.1.

Offline Policy Evaluation for DBS
OPE allows the use of experience replay buffer to estimate the expected return of RL policies, without the need of deploying them to the environment directly. Fig. 3 illustrates the use case of OPE in the context of DBS clinical testing. Specifically, during phase and , offline RL uses all trajectories collected historically to train RL policies following different hyper-parameters etc. Then, in phase , OPE can be used to estimate and rank the expected return of these policies, where the top-performing ones can be deployed during the next clinic visit (phase ). Consequently, OPE can effectively reduce the number of testing sessions needed, so the policies that show promise attaining better performance can be thoroughly tested within the short time frame. Also, it can demonstrate the effectiveness of the policies to be deployed in clinics.
The goal of OPE can be defined as follows.  the logged trajectories are used to evaluate the performance of deployed RL controllers, as well as training data for further fine-tuning; (2) patient feedback including results from bradykinesia tests and a rating on the scale between 1-10; (3) patient tremor severity captured by wearable devices.
Most existing OPE methods, such as [10,14,16,29,42,54,61,62,65], are heavily based on importance sampling (IS) and could result in inconsistent estimations due to the high variance of the IS weights [10,42]. On the other hand, model-based OPE methods have shown strengths in estimating the expected returns more accurately [14,20], by directly capturing the MDP transitions and rewards. The variational encoding-decoding based deep latent MDP model (DLMM) introduced in [20] is shown to be effective evaluating controlpolicies for a computational BG model. Specifically, DLMM is derived following the variational inference framework from [32]. The basics of DLMM are provided in Appendix C.2, and we refer the readers to [32] for basics of variational inference. In Sec. 5, we extend it toward the clinical use case considered in this work, to allow for including the QoC metrics that can be only evaluated once in each session, such as the bradykinesia results, patient ratings, and tremor severity, which will be available as illustrated in Fig. 4.

DBS SETUP USED IN CLINICAL TRIALS
We build on the research-only Medtronic's Summit RC+S system [59] to enable testing of RL-based controllers in clinical trials. The overall architecture of the RC+S-based system we developed is illustrated in Fig. 2. Specifically, Medtronic provides the code and communication APIs (Summit program), which enable the stimulation amplitude of the pulses delivered by the internal pulse generator (IPG) to be adapted over time. The Summit program is developed using the C# language under the .NET framework, which we extended to execute RL policies leveraging the provided Summit research development kit (RDK), requiring the use of a Windows OS.
Thus, a research tablet is used for the execution of the developed DBS controllers; the desired stimulation amplitude is computed for each control cycle (every 2 seconds) and sent to the IPG over Bluetooth TM , using proprietary communication and security protocols. On the other hand, the IPG transmits to the controller the LFPs captured from the BG, from which the beta amplitude of the LFPs, denoted by , is calculated and used as a quality of control (QoC) metric as well as potential control feedback signals (i.e., inputs to the RL controller). Each clinical trial session lasts 5-20 minutes depending on the schedule of the visit, and multiple controllers can be tested across different sessions. All the computed and stimulation amplitudes applied over time are logged for future training and evaluation purposes, as summarized in Fig. 4. For the developed system design, we obtained the FDA's Investigative Device Exception (IDE) G180280, which has allowed us to perform human experiments according to an Institutional Review Board (IRB) protocol approved by Duke University Medical Center.
In addition to , three other QoC metrics are collected from every patient at the end of each session. Specifically, near the end of each session, the patient is asked to perform 10 seconds of hand grasps (rapid and full extension and close of all fingers) maneuver [55] to evaluate the severity of the possible bradykinesia caused by PD. Such hand motions are captured by a leap motion sensor by Ultraleap [8]. Then, the elapsed time between any two consecutive open fist is captured and recorded by the sensor, after which the grasp frequency can be calculated as here, is the total number of open fists throughout the 10 s test, and ( , +1) is the time spent between the -th and + 1-th grasp. Further, at the end of each session, the patient provides a score between 1-10, with 10 indicating the highest level of satisfaction with the treatment received in the past session, and 1 being the lowest, i.e., The grasp frequency and rating for each session are also recorded, which corresponds to the patient feedback stream in Fig. 4. Throughout all sessions, an Apple watch is worn by the patient at their wrist, where the Apple's movement disorders kit [53] is used to analyze the accelerometry movements, classifying the patient's tremor severity as no-tremor, slight, mild, moderate and strong every 1 minute, following StrivePD's implementation [9]. At the end of each session, an overall tremor severity is recorded as the fraction of time the patient experiencing mild ( ), moderate ( ) or strong ( ) tremor over the entire session with length , i.e., The three data streams are collected from all trial sessions after each clinical visit. Moreover, each time a patient may come into the clinic with slightly different PD conditions (e.g., pathology progression over time), medication prescriptions, activity levels etc.; thus, our goal is to capture impact of such changes by the data collection process, in order to facilitate the training and testing the offline RL and OPE frameworks for DBS.

OFFLINE RL DESIGN OF DBS CONTROLLERS
In this section, we employ offline RL for learning control policies for DBS clinical trials, starting from the formulation of an MDP M capturing the underlying neurological dynamics in the BG, and the policy distillation technique that allows for reducing the computational time and resource needed to evaluate the RL policies (represented by DNNs).

Modeling the BG as an MDP
We now define the elements of an MDP M = (S, 0 , A, P, , ).
State Space S and the Initial State 0 . As discussed in Sec. 2.1 and 3, our DBS controller supports calculation of from LFPs, and the changes in can be used as a biomarker for PD-levels for some patients. Thus, we consider the MDP state, at a discrete time step , as a historical sequence of sampled at a fixed intervals, captured by ∈ Z + , over a sliding queue of size ∈ Z + , i.e., Here, (˜) 's are the evaluated at the elapsed time˜since the clinical trial starts, is configurable in our system design (Fig. 2), and we used = 2 corresponding to calculating every 2 , resulting in 20 time-windows for = 10 elements in the queue; finally, ∈ R is the state at -th (discrete) step of the MDP. The initial state 0 is considered to be the sequence collected right before the clinical trial starts, i.e., from˜= −( − 1) to˜= 0.
Action Space A. The amplitude of DBS stimulation pulses can be changed in pre-defined (discrete) time steps, i.e., every 2 seconds for the developed controllers. We consider the actions as the percentage of the cDBS amplitude determined by clinicians; i.e., ∈ [0, 1] ⊂ R, where = 0 and = 1 correspond to no-DBS and stimulation with the same amplitude as in cDBS, respectively. Transition Dynamics P : S × A → S. Every time after the stimulation amplitude is adjusted following , the system computes the latest (˜+ ) using the LFPs sent back from the IPG; this leads to the MDP state at the (t+1)-th (discrete) step as i.e., the left-most element in (6) is pushed out, with (˜+ ) appended to the right-end. Note that we define the MDP states and actions over discrete time steps, 's, instead the elapsed time˜, for the conciseness of equations and presentations below. Now, the MDP transitions are captured to directly follow +1 ∼ P ( , ).
Reward Function : S × A → R. Following from the setup of the DBS system (Sec. 3), we define the rewards as specifically, if the beta amplitude received at the ( + 1)-th step, (˜+ ) , is less than some threshold , then a non-negative reward is issued along with the term − 1 · ( 1 > 0, 1 ∈ R) penalizing over-usage of large stimulation amplitudes (for better energy efficiency). On the other hand, if (˜+ ) is greater than the threshold , a negative reward will be used to replace above.
Remark 4.1. The reward functions used for RL training do not consider the QoC metrics that are available not at every step of the control execution (i.e., every 2 ) but only at the end of each clinical session, i.e., , , from (3), (4), (5). The reason is that the horizon is usually large and the their coverage can be very sparse. Instead, these QoC metrics serve as great measurements quantifying how well the policies perform, which are thus leveraged by the OPE techniques introduced in Sec. 5.
For the introduced MDP M, we leverage the offline RL framework introduced in Sec. 2.2 to search for the target policy * . Following from Problem 1, it requires an experience replay buffer E that consists of historical trajectories collected over some behavioral policy . At the beginning of offline RL training, exploration of the environment is deemed more important than exploitation [28]. Hence, a controller that generates random actions uniformly from [ , 1] is used to constitute E at earlier stage of clinical trials, where is the lower bound from which the random can be generated, for the sake of patient's safety and acceptance.
Once the RL policies can attain satisfactory overall performance, i.e., quantified as achieving significantly improved QoCs (introduced in Sec. 3) compared to the random controller above, we consider including into E the trajectories obtained from such RL policies. From this point onward, the replay buffer E will be iteratively updated and enriched with the RL-induced trajectories after each trial. Consequently, the behavioral policy can be considered as a mixture of random control policy and several RL policies deployed in past trials in general. With E being defined, the objective for training RL policies, (20), can be optimized using gradient descent [19,20,39].

Policy Distillation
Our system design (Fig. 2) is set to process various tasks in each 2 stimulation (i.e., control) period, facilitating communication between the research tablet and IPG, computing from LFPs, evaluating the RL controller, data logging, and other basic functionalities that ensure the safety and functionality of DBS. Hence, it was critical to reduce the overall computation requirements, such that each task meets the required timings, as well as prolong the battery lifetime. As introduced in Sec. 2.2, the RL policies are parameterized as DNNs; although a forward pass of a DNN would not require as much computational resources as for training (through back-propagation), it may still involve hundreds of thousands of multiplication operations. For example, consider the recommended DNN size as in [39], it takes at least 120,000 multiplications to evaluate a two-layer NN with 400 and 300 nodes each. Hence, we integrate into our system the model/policy distillation techniques [26], allowing smaller sized NNs to be used to parameterize RL policies.
We build on a similar approach as in [56], originally proposed to reduce the size of DNNs used in deep Q-learning [46], which only works for a discrete action space. In particular, our extension allows for the use in the deterministic actor-critic cases considered in this work. Consider the original policy (teacher) parameterized by a DNN with weights . We train a smaller-sized DNN (student) with weights˜to learn 's behavior, by minimizing the mean for all state samples contained in the experience replay ∈ E . We also consider augmenting the data used to optimize (9) to smooth out the learning process. We introduce synthetic states,˜'s, where each˜is generated by adding noise to each dimension of a state sample that is originally in E ; the noise is sampled from a zero-mean Gaussian distribution, ∼ N (0, 2 ) with being a hyper-parameter.

OPE OF DBS CONTROLLERS INCLUDING PATIENT FEEDBACK AND TREMOR DATA
As discussed in Remark 4.1, besides the reward function introduced in Sec. 4.1, for OPE we employ QoC metrics , , and defined in (3), (4), (5), respectively, which are only available at the end of each session. As these well-capture performance (i.e., therapy effectiveness) of the considered policy, for OPE we additionally consider the end-of-session rewards defined as with 2 , 3 , 4 > 0 real constants. Without loss of generality, we slightly modify the total return under policy (from Problem 2) as where and follow from (8) and (10), respectively. As discussed in Sec. 2.3, the DLMM introduced in [20], falls short in dealing with long horizons and predicting the end-of-session rewards . To address these limitations, in this section we introduce the deep latent sequential model (DLSM) that directly enforces the transitions over the LVS. The overall model architecture is shown in Fig. 5. First, the latent prior ( 0 ) is defined only over the initial latent variable at step = 0, 0 , which follows a multivariate Gaussian distribution with zero mean and identity covariance matrix.
Then, the encoder (approximated posterior) is defined over each trajectory (from = 0 to ) as Further, the second term ( | −1 , −1 , ), which enforces the transitions between −1 and conditioned on ( −1 , ) and enables the encoder to capture the dynamical transitions in the LVS, can be obtained iteratively following here, ( 0 | 0 ) and ( |ℎ ) are parameterized by multivariate diagonal Gaussian distributions, each with mean and covariance determined by a feedforward DNN [6]; moreover, ℎ is the hidden state of a recurrent DNN, such as long short-term memory (LSTM) [27], capturing the historical transitions among , and for all past steps up until − 1 within each trajectory. The decoder (sampling distribution) is responsible for interacting with the target policies to be evaluated, from which the expected returns can be estimated as the mean return obtained by the simulated trajectories. Specifically, the decoder is defined as follows, i.e., here, ( | ) estimates the end-of-session rewards given the latent variable at = , ; ( | ), ( −1 | ) reconstruct the states and rewards; ( | −1 , −1 ) enforces the transitions over the latent variables, 's, conditioned on the actions; and 0 ∼ ( 0 ) is sampled from the prior. As a result, each simulated trajectory can be generated by the decoder following here, ℎ is the hidden state of a recurrent DNN; ( |ℎ ), ( | ), ( −1 | ) and ( | ) are multivariate diagonal Gaussians with means and covariances determined by four feedforward DNNs separately. Hence, 's and −1 's can be sampled iteratively following the process above, using the actions obtained from the target policy −1 ∼ ( −1 | −1 ) accordingly, which constitute the simulated trajectories; and is sampled at the end of each simulated trajectory.
The theorem below derives an ELBO for the joint log-likelihood log ( 0: , 0: −1 , ), following the above DLSM architecture. ) can be obtained as ≤ log ( 0: , 0: −1 , ); here, the first three terms are the log-likelihood of the decoder to reconstruct , −1 and correctly, and the two terms that follow regularize the transitions captured by the encoder over the LVS, with (·||·) being the Kullback-Leibler (KL) divergence [35].
The proof of Theorem 5.1 can be found in Appendix F. Empirically, similar to the DLMM [20], the ELBO can be evaluated using the trajectories from the experience replay E , by replacing the expectation as the mean over all trajectories, after which the objective max , L ( , ) can be achieved using gradient descent [31] following the algorithm in Appendix D. Moreover, the reparameterization trick [32] is used, which allows for the gradients to be back-propagated when sampling from Gaussian distributions with means and covariances determined by DNNs. Details on reparameterization can be found in [20,32].

CLINICAL EVALUATIONS
Using our closed-loop DBS system presented in Sec. 3, we evaluated the developed RL-based control framework in clinical trials on four PD patients, at Duke University Medical Center. In particular, we evaluated and compared four different types of controllers: cDBS, RL, RL with policy distillation (i.e., distilled RL), and no-DBS (i.e., without stimulation). The electrodes of the DBS device were placed in STN and GPi brain regions for all four participants; LFPs were sensed from STN and stimuli were delivered to both STN and GP.
Each participant also has had different PD symptoms and severity; their characteristics are summarized in Appendix E. All trials were conducted under close supervision of clinical experts, strictly following the process approved by the Duke University Medical Center IRB protocol complying with the obtained FDA IDE (G180280). Further, all participants provided informed written consent.

Therapy Efficacy and Energy-Efficiency of the RL Control Policies
We follow the offline RL and policy distillation methodology introduced in Sec. 4 to train and update (distilled) RL policies iteratively over time. Specifically, each participant had monthly clinical visit, where during each day of trials a total of 2-4 RL policies would be tested. A cDBS session was placed in between any two RL sessions as a control group. A small number of no-DBS sessions, with DBS stimulation fully off, were also tested, to validate our choice of the employed QoCs metrics -i.e., whether they significantly change when the participants are not stimulated. After each trial day was completed, the trajectories collected from all the sessions were added to the experience replay buffer E unique to each participant. Between two consecutive visits of each participant, her E was used to fine-tune the top-performing policies determined from the last trial (using smaller learning rates between [10 −7 , 10 −5 ]) or to train new policies from scratch (with learning rates between [10 −5 , 10 −3 ]); such policies were then tested in the next visit. We followed [39] and used two-layer NNs with 400 and 300 nodes each to parameterize the RL policies; moreover, a distilled version (student) of each corresponding full-sized RL policy (teacher) were trained as introduced in Sec. 4.2, with each represented as a two-layer NN with 20 and 10 nodes. The constants in (8) were set to = 0, = −1, 1 = 0.3 for all participants.
In each testing session, to evaluate the overall performance of the employed control policy, a total of 5 metrics were considered: the energy used by the IPG for stimulation, the mean beta amplitude over Figure 6: Quality of control (QoC) results from all clinical trials across participants. Wilcoxon rank-sum tests [43] between cDBS and each of the other controllers are used to test the null hypothesis that two sets of measurements are drawn from the same distribution, resulting in the -values reported above. The null hypothesis is rejected when consider the stimulation energy consumed by both RL controllers, illustrating that they lead to significant energy reduction compared to cDBS. For all other QoCs, the null hypothesis is accepted in majority cases, showing that both RL controllers can in general attain similar control efficacy to cDBS. The controllers that lead to the acceptance/rejection of the null hypothesis in the desired direction are highlighted with asterisks and bold -values. Participant 1 84  97  97  36  Participant 2 145  80  182  52  Participant 3 135  115 115  39  Participant 4 124  119 98  48   Table 1: Overall time, in minutes, spent toward testing each type of controller in clinical trials. Each testing session lasted 5-20 minutes, and no-DBS sessions were usually 5min long to minimize the discomfort participants may experience.

cDBS RL Distilled RL No-DBS
the session, and the 3 QoCs introduced in Sec. 3; for , we captured the grasp frequencies of the hand that best correlates with the PD symptom for the participant (see Appendix E for details). Fig. 6 summarizes the obtained results, and Table 1 documents the total amount of time each controller was tested in clinic. Wilcoxon rank-sum tests [43] between cDBS and each of the other controllers were used to test the null hypothesis -if two sets of measurements were drawn from the same distribution (i.e., that the controllers perform similarly over the considered metrics); from this, -values can be calculated. The -values accepting/rejecting the null hypothesis in the desired direction are highlighted in Fig. 6. Specifically, it can be observed that, compared to cDBS, the RL policies and their distilled version can save significant (20%-55%) stimulation energy across participants; as < .05 achieved for all participants, which rejected the null hypothesis.
When considering the other 4 metrics, there exist a great majority of results with ≥ .05, accepting the null hypothesis and indicating that both RL controllers attain control (i.e., therapy) efficacy similar to cDBS. In contrast, for the no-DBS sessions, the null hypothesis is rejected in most cases. Specifically, < .05 attained by no-DBS over the mean beta amplitude, for all participants, show that beta amplitudes can change significantly when sufficient DBS is received or not, which justify our choice of using the beta amplitudes to constitute MDP states. This also shows that the RL policies can follow the reward function (from Sec. 4.1) to effectively optimize the control strategies, with beta amplitudes also playing an important role. Consequently, the results show that both full and distilled RL policies can significantly reduce the stimulation energy, while achieving non-inferior control efficacy compared to cDBS.

RL Distilled RL Random Controller
Battery Runtime (m) 227 ± 5 220 ± 6 247 ± 4 Table 3: Overall battery runtime of the DBS system when the RL, distilled RL or random controllers were used.

Computational Complexity and Overall Energy Consumption.
We also study the additional computation time and battery consumption of the DBS system due the use of full-sized RL policies or their distilled version. A Surface Go with an Intel Pentium Gold 4415Y CPU and 4GB RAM was used as the research tablet in Fig. 4.
The computation time was quantified as the time needed to run a single forward pass of the NN that represents the RL policy. We evaluate the forward passes for both types of RL policies 200 times; Table 2 summarizes themean and standard deviation of the obtained computation times. As can be seen, the distilled RL policy can be evaluated significantly faster than its counterpart. Moreover, we quantify the overall battery consumption of the entire DBS system as the time for which the tablet or the IPG battery drains from 100% to 10% (whichever comes first). We compare the battery runtime among the full RL and distilled RL, as well as a random controller that sets the IPG to stimulate with an arbitrary amplitude in each control cycle. Each experiment was repeated 3 times, resulting in the statistics in Table 3 showing that the two RL-based controllers do not drastically shorten the runtime of the DBS system; i.e., the energy used for RL-based control does not dominate the overall energy used by the DBS system.

Evaluation of the OPE Methodology
For each participant, a DLSM was trained following the methodology introduced in Sec. 5, and then used as a synthetic environment to interact with 6 policies trained using the deep actor-critic method (Sec. 4) with different hyper-parameters, over the buffer E specific to the patient; these policies can in general lead to varying performance. Then, for each policy, the mean of total returns (11) over all simulated trajectories can be calculated, and was used to estimate the policy's expected return from Problem 2. The constants in (10), balancing the scale of the QoCs (i.e., grasp frequency, rating and tremor severity) were set to 2 = 3 = 4 = 10 for patients 2-4 who can experience bradykinesia and pronounced tremor with insufficient DBS; in contrast, the symptoms of participant 1 are considered subtle, so we set 2 = 3 = 4 = 25 to better distinguish if sufficient DBS is provided; see Appendix E for details on patient characteristics as well as the dosage of PD medications.
DLSM's performance was compared against the classic IS [54], as well as a state-of-the-art IS-based OPE method, dual-DICE [47]. Three metrics were considered to evaluate the performance of OPE, including mean absolute error (MAE), rank correlation, and re-gret@1, following from [14]. MAE evaluates the absolute error between the total return estimated by OPE, versus the actual returns, i.e., mean total return recorded from clinical trials. Rank correlation quantifies the alignment between the rank of policies over OPE-estimated returns and the actual returns. Regret@1 quantifies the percentage loss, over the total actual returns, one would get by picking the policy with maximum OPE-estimated return, against the actual best-performing policy, showing if the OPE methods can identify the best-performing policy correctly. Their mathematical definitions can be found in Appendix G.
The obtained results are summarized in Fig. 7. As shown, the DLSM in general achieved significantly higher rank and lower regret, as well as non-inferior MAE, over DICE and IS.

CONCLUSION
In this paper, we introduced an offline RL and OPE framework to design and evaluate closed-loop DBS controllers using only historical data. Moreover, a policy distillation method was introduced to further reduce the computation requirements for evaluating RL policies. The control efficacy and energy efficiency of the RL controllers were validated with clinical testing over 4 patients. Results showed that RL-based controllers lead to similar control efficacy as cDBS, but with significantly reduced stimulation energy. The computation times for the RL and distilled RL controllers were compared, showing that the distilled version executed significantly faster; future work will focus on further reducing execution times of the distilled RL controllers to match capabilities of implanted devices. Finally, the DLSM is trained to estimate the expected returns of RL policies, which outperforms existing IS-based OPE methods, in terms of rank correlations, regrets and MAEs.
distributions with mean and covariance matrix Σ over variable . For simplicity, we write ∼ N ( , Σ) during sampling. The KLdivergence between distributions ( ) and ( ) is defined as

B AVAILABILITY OF DATA AND CODE
We plan to open-source the data collected from clinical testing, as well as the implementation for training RL policies, in the future 1 , to facilitate research in developing RL-based DBS controllers. The RC+S system as well as its Summit code base are considered proprietary, which may not be published online. The implementation of our OPE method, DLSM, is built on our previous works [17,20], with code published at https://github.com/gaoqitong/vlbm.

C ADDITIONAL PRELIMINARIES
Below we introduce in details the preliminaries needed to supplement Sec. 2.

C.1 Deep Actor-Critic RL
We now briefly introduce the deep actor-critic algorithm [39] and refer the readers to [19,20,39] for more details. First, the stateaction value functions can be defined as follows.
Definition C.1 (State-Action Value Function). Given an MDP M and policy , the state-action value function ( , ), where ∈ S and ∈ A, is defined as the expected return for taking action when at state following policy at stage , i.e., Two neural networks, with weights and , can be used to parameterize the policy (actor) ( ) : S → A and the Q-functions (critic) ( , ) : S × A → R, respectively. Finally, the target policy * = * can be obtained by optimizing over this can be achieved using gradient descent, over all the training samples in the experience replay buffer E [39].

C.2 Deep Latent MDP Model (DMLL)
The DLMM is trained to fit the transitions of the MDP ( +1 | , ) and rewards = ( , ), which consist of three components, i.e., a latent prior, posterior and sampling distribution. The prior ( ) is parameterized by , over the latent variable space (LVS) Z ⊂ R , where ∈ Z + is the dimension. The prior represents one's belief over the latent distribution of the states (i.e., probability density function over the latent variables) which is considered unknown; thus, it is usually chosen to be a multivariate Gaussian with zero mean and identity covariance. Then, the encoder (or approximated posterior) ( | ) is parameterized by , which is responsible for encoding the MDP state ∈ S into the LVS, Z. Note that the true posterior ( | ) is intractable, since its density function contains integration over Z which is deemed unknown; we refer to [20] for more details. Lastly, the decoder (or sampling 1 After finalizing a journal submission built on top of this work. Sample a trajectory [ ( 0 , 0 , 0 , 1 ), . . . , Run forward pass of DLSM following (13) and (15) here,ˆ,ˆ+ 1 andˆrepresent the states and rewards predicted from DLMM. Consequently, the expected return of can be estimated as 1 where is the total number of simulated trajectories generated following the process above, andˆ( ) is the predicted reward at step in the -th simulated trajectory. To train DLMM, one can maximize the evidence lower bound (ELBO) of the joint log-likelihood =0 log ( +1 , ), where the derivation of the ELBO for DLMM can be found in [20]. From (21), it can be observed that the predictedˆ+ 1 ,ˆare conditioned on 's, which are dependent on the predicted stateˆfrom the last step. As a result, such an iterative process may not scale well to environments with longer horizons and more complicated dynamics, as the prediction error from all earlier steps are propagated into the future steps. Moreover, this DLMM is not capable of predicting the QoC metrics that are only evaluated once at the end of each session, including the bradykinesia results, patient ratings and tremor severity as discussed in Sec. 3. To address such limitations, we introduce a new latent modeling method in Sec. 5, which decouples the dependencies between 's andˆ's by directly enforcing the temporal transitions over latent variables, i.e., ( +1 | , ).

D ALGORITHM TO TRAIN DLSM
Here we introduce how to use gradient descent to maximize the ELBO (16), resulting in Algorithm 1.For simplicity, we first illustrate with the case where the training batch only contains a single trajectory, and then extend to the cases where each batch contain trajectories. In each iteration, a trajectory is sampled from the experience replay buffer E . Then, the initial latent state in the encoder is obtained following 0 ∼ ( 0 | 0 ), while the initial latent state for the sampling distribution is generated following the latent prior 0 ∼ ( 0 ). The processes introduced in (13) can be used to generate 's iteratively given 0 . Similarly, , , can be generated iteratively following (15). As a result, the log-likelihoods and KL-divergence terms within the expectation in L , defined in (16), can be evaluated using the variables above, after which , can be updated using the gradients ∇L , ∇L , respectively, whereL refers to all the terms within the expectation in L . This algorithm is summarized in Alg. 1. To extend to batch gradient descent, in line 3 of Algorithm 1, a batch of trajectories, B ( ), will be sampled, i.e.,

E PARTICIPANT CHARACTERISTICS
Participant 1. Episodes of tremor only. Bradykinesia in the left hand correlates with right STN beta amplitude. STN beta amplitudes in both hemispheres correlate with stimulation amplitude. This participant takes 2 tablets of Sinemet 25mg/100mg for each clinical visit, one at 2 hrs before the testing begins ( 7am) and another in the middle of the visit (around noon) respectively. Participant 2. Very large amplitude tremor returns within seconds of low amplitude DBS. Bradykinesia in right hand highly correlated with left STN beta amplitude. Left STN beta amplitude also correlated with DBS amplitude. This participant takes Flexeril 10mg, Selegiline 5mg and Pramipexole 0.125mg (1 tablet for each) at 2 hrs before the testing begins. Participant 3. Pronounced tremor (hands and jaw) returns within seconds of low amplitude closed-loop DBS. Bradykinesia in right hand correlates with left STN beta amplitude. Left STN beta amplitude is the most responsive to stimulation of the cohort. This participant takes 2 tablets of Sinemet 25mg/100mg for each clinical visit, one at 2 hrs before the testing begins ( 7am) and another in the middle of the visit (around noon) respectively.
Regret@1. Regret@1 is the (normalized) difference between value of the actual best policy, against value of the policy associated with the best OPE-estimated return, which is defined as (max ∈1: − max ∈best(1: ) )/max ∈1: where best(1 : ) denotes the index of the best policy over the set of policies as measured by estimated valuesˆ.
Mean Absolute error (MAE). MAE is defined as the absolute difference between the actual return and estimated return of a policy: = | −ˆ|; here, is the actual value of the policy , and is the estimated value of .